Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caddytls: Reuse certificate cache through reloads #5623

Merged
merged 5 commits into from
Jul 11, 2023
Merged

Conversation

mholt
Copy link
Member

@mholt mholt commented Jul 8, 2023

Previously, Caddy created a new certificate cache for every config load. This ensured that the state of the certificate cache was consistent with the newly-loaded config, no matter what changes were made.

However, this is a bottleneck for some users with large TLS deployments, some of which exceed 100K sites/certificates, as the entire cache was GC'ed and a new one had to be created by loading each certificate from storage and decoding it (not to mention allocating memory for this, which is not cheap -- certificate encodings are messy!) up to the cache capacity, which is configurable to an arbitrarily large size.

On-demand TLS can be used to relieve this pressure, as certificates are only loaded into the cache, when handshakes come in. But for busy servers, that turns out to be almost instantly. And it hammers the 'ask' endpoint which is now required, and we have received complaints about this in the beta version.

So, this is not a very small change, but it's not as big as I thought it'd be, to use only 1 certificate cache, and to persist it through config reloads.

When a config using the TLS app is loaded, the TLS app is provisioned and started; this loads certificates into the cache which are not already loaded or which are manually-loaded. We now keep a list of which subjects/certificates a config loads. This adds slightly more memory overhead but insignificant relative to the actual certificates.

Then, when a TLS app is "cleaned up" (decommissioned in preparation for being garbage-collected), it loads the currently-active Caddy context which has a reference to the new/current TLS app. Any certificates/subjects that are listed only in the "old" TLS app are removed from the cache.

This does have the side-effect that managed certificates can't be forcefully-reloaded from storage unless the config is completely unloaded first (or the server is stopped) and then reloaded. This is because managed certificates don't get evicted from the cache if the next config is also using them; however, when those certificates come due for renewal, the new config will apply when renewing them.

Manually-loaded certificates are still reloaded from storage at every config reload, because the server does not manage them, config reloads are how you tell the server that you updated the certificate. (This is the "classic" method of using TLS certificates with traditional web servers: update the cert on disk, then reload the server config to apply the change.) So if you are manually loading thousands of certificates, this change won't help you much. But then why are you using Caddy...

Anyway, I still need to finish this and do more testing to ensure it actually works. My hunch is this should drastically lighten config reloads when there's lots of certificates.

@mholt mholt added the optimization 📉 Performance or cost improvements label Jul 8, 2023
@mholt mholt added this to the v2.7.0 milestone Jul 8, 2023
@mholt mholt requested a review from francislavoie July 8, 2023 16:31
@francislavoie
Copy link
Member

This does have the side-effect that managed certificates can't be forcefully-reloaded from storage unless the config is completely unloaded first (or the server is stopped) and then reloaded.

I wonder if we could parameterize this, i.e. adding some optional flags to reloading to change behaviour. Something for later. Would mean HTTP headers in the API, possibly env vars and/or CLI flags for the reload command, then passing through those flags in context for provision/cleanup/start/stop 🤔 that would let users do a graceful reload with a forced cert reload.

This actually avoids reloading managed certs from storage
when already in the cache, d'oh.
@mholt
Copy link
Member Author

mholt commented Jul 8, 2023

I wonder if we could parameterize this, i.e. adding some optional flags to reloading to change behaviour. Something for later.

We'll see what the need is first. I'm not sure what the use case would be for that kind of requirement at this stage.

PS. Just pushed a commit that updates CertMagic so we don't reload managed certs from storage that are already in the cache 😅

modules/caddytls/tls.go Show resolved Hide resolved
Comment on lines +354 to +355
certCache.RemoveManaged(noLongerManaged)
certCache.Remove(noLongerLoaded)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names aren't symmetrical. Call the other one RemoveUnmanaged?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! I did at first, but realized that technically this would remove any certificate by that hash, whether it's managed or not.

In practice, I suppose you'll only have the hash of unmanaged certificates since the managed ones don't expose that (unless you manually compute that yourself)...

I'll think about it!

@mholt
Copy link
Member Author

mholt commented Jul 8, 2023

Oh, and minor implementation detail: this change uses blake3 to hash certificates, instead of sha256.

In my testing, Blake3 was nearly twice as fast, but with 2 allocations instead of 1:

goos: linux
goarch: amd64
pkg: hashtest
cpu: AMD Ryzen 9 3900X 12-Core Processor            
BenchmarkBlake3-24        769701              1723 ns/op              64 B/op          2 allocs/op
BenchmarkSHA256-24        438304              2645 ns/op              32 B/op          1 allocs/op

(I think 64 B/op is still acceptable though. Blake3 apparently always computes a 64-byte digest but truncates it to 32 bytes by default, and we just use the default.)

@mholt mholt marked this pull request as ready for review July 10, 2023 03:56
@mholt mholt linked an issue Jul 11, 2023 that may be closed by this pull request
@mholt
Copy link
Member Author

mholt commented Jul 11, 2023

@francislavoie I think I'm ready to merge this; want to do any final review or does it LGTY?

Copy link
Member

@francislavoie francislavoie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep LGTM. Obviously there's still the question of the ActiveContext mutex but 🤷‍♂️ I haven't dug into it either

@mholt
Copy link
Member Author

mholt commented Jul 11, 2023

Yeah. Tests with the race detector don't raise any errors, so I'm good with trying this for now. We can rejigger the locking later if needed.

Thanks!

@mholt mholt enabled auto-merge (squash) July 11, 2023 19:01
@mholt mholt merged commit 0e2c7e1 into master Jul 11, 2023
22 checks passed
@mholt mholt deleted the single-cache branch July 11, 2023 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization 📉 Performance or cost improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Certificate cache is flushed when new config is loaded, causing downtime
2 participants